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Abstract 

This paper presents a spectral element finite element scheme that efficiently 
solves elliptic problems on unstructured hexahedral meshes. The discrete 
equations are solved using a matrix-free preconditioned conjugate gradient 
algorithm. An additive Schwartz two-scale preconditioner is employed that 
allows h-independence convergence. An extensible multi-threading program¬ 
ming API is used as a common kernel language that allows runtime selecti 
on of different computing devices (GPU and CPU) and different threading 
interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate 
that problems with over 50 million degrees of freedom can be solved in a 
few seconds on an off-the-shelf GPU. 
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1. Introduction 

Recent research efforts HJGSj have led to the development of 3D hex- 
dominant mesh generation systems that are fast and reliable. It is now 
possible (e.g. with Grnsh m) to create meshes of general 3D domains that 
contain over 80 % of hexahedra in volume in a fully automatic manner. 

We foresee that fully automatic hex-meshing procedures will be available 
in the next decade. This perspective allows finite element researchers to 
reconsider some commonly held beliefs, namely that tet-meshing may not 
remain the only solution for automatic mesh generation. 

Quadrilateral meshes in 2D and hexahedral meshes in 3D are usually 
considered to be superior to triangular/tetrahedral meshes. There are nu¬ 
merous modeling reasons to prefer hexes: boundary layers in CFD [22]. 
inaccuracy or locking problems in solid mechanics [2], 
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From a high order spectral finite element perspective, hex meshes pro¬ 
vide considerable advantages. First, although this is not specific to spectral 
finite elements, a hex mesh contains about seven times fewer elements than 
a tet mesh with the same number of vertices. Fewer elements mean less data 
storage and a faster assembly procedure. Taking advantage of the inherent 
tensor-product structure of hexahedral basis functions one can dramatically 
reduce the number of floating point operations for computing finite element 
operators. The local cartesian structure of the mesh provides natural over¬ 
lapping patches of elements that enables the construction of efficient local 
preconditioners. Finally, spectral hex-meshes can achieve relatively high 
throughput on GPUs following the approaches described below. 

The use of GPUs for accelerating finite element solvers for elliptic prob¬ 
lems is of course not new. In early work Goddeke et al m investigated seal- 
ability of finite element solvers on GPU clusters. Later Goduke described 
multigrid methods for finite element methods on GPU clusters [12]. Cecka 
et al [3] and Markall et al [16] discussed algorithms for efficient stiffness ma¬ 
trix assembly on GPUs. Knepley et al |15] described algorithms for efficient 
evaluation of finite element integrals on GPUs. Gaveled et al nn introduced 
a finite element toolkit that integrates geometric multigrid techniques with 
sparse approximate inverse algorithms on GPUs. Furthermore, pushing the 
envelope of GPU based finite element software design Fu et al [7] describe 
a systematic approach to pipelining finite element methods. Largely these 
prior approaches have focused on optimizing the process of stiffness matrix 
assembly. The current work differs by first using a high-order finite element 
approach and secondly adopting a matrix-free approach that in its lean¬ 
est form only requires storage for mesh vertex coordinates, residual vector, 
solution vector, load vector, and indexing arrays. 

In this paper, we propose a numerical scheme that allows us to solve 
Poisson-like problems on unstructured all-hex meshes using the massive 
multi-threading capacities of modern computer hardware. An extensible 
multi-threading programming API is used as a common kernel language m 
to try our numerical scheme on different devices (GPU and CPU) and using 
different thread programming interfaces (CUDA, OpenCL, and OpenMP). 

This paper is structured as follows. In )|2j standard properties of spectral 
finite elements are presented in brief. The numerical method is presented 
in ^3] and f|4j preconditioned conjugate gradients are used for solving linear 
systems. A two-scale additive Schwartz preconditioner is used for acceler¬ 
ating the convergence. Details of implementation are presented in Ij5] and 
results are presented in <|6] 


3 


2. Spectral Finite Elements on Hexahedral Meshes 


Consider a domain 17 G i? 3 with boundary F = r^UlV and the following 
model problem: find u(x, y, z) that satisfies 


cu — V • ( kVu ) 

= s 

on 

u = 

= U 0 

on 

du 



dn 

= 9 

on 


n, 


r D 

(i) 

r N 

(2) 


where c(x, y, z) > 0, k(x, y, z) > 0 and s(x, y, z) is a given source term. We 
further suppose that s, uq and g satisfy the standard regularity assumptions 
and, without loss of generality, that uq = 0. A weak formulation of ([2]) is: 
find u G Hq(FI) that satisfies 



■Vw + cu w\ dxdydz 


r w dxdydz 


Vw G Hq(FI) 


( 3 ) 


where Hq(Q) = {«£ i/ 1 (17), u|r = 0}. 


2.1. Interpolation 

Consider now a mesh constructed of unstructured hexahedra. On each 
hexahedron e, the finite element interpolation basis is a tensor products 
of one dimensional basis of P n that are the set of Lagrangian interpolants 
4>j(t), j = 0,..., n on the Gauss-Lobatto Legendre (GLL) quadrature points 
in the reference domain: t 7 e [—l,+l],i = 0, ...,n,(j>j(ti) = Sij [4j. 

In the reference hexahedron G [—1,+1] of element e, fields are 

interpolated as 


Uijk^eMO&jivHki C) ( 4 ) 

i=0 j =0 k =0 


where u^k-e are the values of u at the (n+ l) 3 nodes of element e. We define 
the derivation matrix D following [0] as 


Dij — 


d<pi 

dt 


t=tn 


( 5 ) 
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2.2. Local and Global vectors 

Consider a mesh made of N E unstructured hexahedra with a total of N 
GLL nodes and a scalar field u interpolated on the mesh. In the following, 
two representations of u will be used, one that is defined locally to one 
element and a second that is defined globally on the mesh. The local version 
of u is denoted by 


Uijk-e, 0 <i,j,k <n, 1 < e < N E . 

A global indexing of the GLL nodes is defined that associates a unique 
number to every GLL node A f of the mesh. The global version of u is noted 

u_\r , 1 < A f < N. 

A local-to-global indexing table M = Ii^, g [ijk] e) provides a global index J\f 
for local node ( ijk ; e). A scatter operation [6] consists of building the local 
representation of a vector from its global representation: 

u ijk-,e ( 6 ) 

The gather operator allows to compute global vectors from local vectors. 
It requires the definition of a global-to-local indexing table I g ^i(N) that 
returns the Nj\f local nodes e A/i) that are associated to a given 

global node A f: 

Ii^g{I g ^i{M)) = eu t ) = M, 1 = 1,..., Ntf. ( 7 ) 


2.3. Geometry 

The unstructured nature of the hexahedral meshes that are considered 
here force us to consider the geometry of each individual element e. The 
geometry of element e is defined through its mapping 


y e {t,v, 0 , z e (£,v,0 


between the reference cube £, 77 , £ 6 [—1,+1] and the element e. We define 


the Jacobian 


J e (H,yX) 


dx e 

dx e 

dx e ' 

dZ 

dr) 

d( 

dy e 


dy £ 

d£ 

dr) 

d( 

dz e 

dz e 

dz e 

dt 

dr) 

d< ] 


its determinant \J\ e = det J e and the symmetric metric tensor G e 


(J e ) T J e . 
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2-4- Integration 

In the spectral element method formulation GLL points are both used 
for interpolation and integration purposes: 


L 


f(x, y, z)dxdydz — EEE fijk m ,e PiPjPk\J\ijk\e 

i=0 j —0 k =0 


Hjk',e 


where the pj's are ID integration weights. GLL points are sub-optimal 
integration points: they only allow to exactly integrate a polynomial of 
order 2n — 1 

The computation of Q is performed in two steps. Local values of the 
residuals are computed at every GLL point of every element: 


Tijk\e ijk\e 


ftijk\e ^^Aijk\e * ^ “1“ ^Hjk\e^ijk]e 


Then local residuals r^k-e are gathered to global GLL nodes A f using ([7]) 


as 


Nu 

r M = ^2 r *A/bA^Af;e- 
l=i 


Equations © and 0 allow to compute 

du 


du 

~di 


^ ' Dim,nmjk\e 
ijk\e m =0 


dr/ 


— ^ ^ DjmUimk-,ei 
m=0 


du 

dC 


^ ^ Dkin^ 

m =0 


ijm\e • 


Thus, rijk-e is computed as 

Tijk\e ^7-zjfc;e [ 

n 

^ ^ ftmjk\e Dm 


m =0 
n 


E 

m =0 
n 

E 

m =0 


ftimk\e D m j 

Cijk\e'U j ijk\e } 


mjk m ,e 


T imk\e 


r 3 9u 

^ijnv.e 


du 

9 du 

q du 


dt 

mjk\e • 

r*o _ 

' ^mjk^e q/- 
mjk\e ^ 

mjk\e 


+ 


du 

^. du 

^ du 



_i_ 

Ksr imk-,e o 
imk\e ^ > 

fio _ 

' ^imk^e 

imk\e 

imk;e 


+ 


ijm-,e 


, r 5 

“r Cr ijm-e 


i jm;e 


. r 6 — 

“r *-'■ zjm;e 


ijm-,e 


+ 


( 8 ) 
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3. Preconditioned Conjugate Gradients 

The aim now is to solve our problem, i.e. find u solution of 
rj\f(u) = 0, 1 < A f < N. 

For that, we use preconditioned conjugate gradients (PCG) because of the 
symmetric positive definite nature of our problem. Algorithm [T] describes 
the PCG procedure. In this description, V a preconditioner. The two 


Algorithm 1: Preconditioned Conjugate Gradients 
u 0 = 0 
r 0 = r(uo) 
zo = V(r 0 ) 


Po = z 0 

for k = 0,1, 2,... do 

/ = r(p k ) 


z I r k 
a k = — 


Pk f 

r U j k+ 1 = ^ k &kPk 

Pk -\-1 ^k &kf 
Zk +1 = Vr k+ 1 

a z k+i r k+i 

Pk+1 = — tt— 


H r k 


Pk+1 ~ Z k+ 1 + f3 k+ ip k 


expensive steps that are computed at each iteration are actually computing 
r(p k ) and computing Vr k+ \. 


4. A two scale preconditioner 

We use here an additive Schwarz preconditioner P using overlapping 
subdomains [S] plus a coarse grid projection operator. Preconditioner V is 
based on solving (i) a coarse problem with low order (n = 1) elements on the 
mesh and (ii) local problems on overlapping subdomains. The contributions 
of the coarse and fine preconditioners are subsequently added: 


V = v c + v f . 
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4-1- Coarse grid preconditioner 

In short, the coarse grid preconditioner works as follows 

V c r = V T V 0 Vr. 

V is a restriction operator that projects the high order residual onto the 
coarse space. In our implementation, Vq consists of algebraic multigrid 
cycles on the finite element matrix computed at polynomial order n = 1. 
The prolongation operator V T is chosen as the transpose of V in order to 
preserve the symmetry of the problem. 

The essential ingredient of this multilevel is the restriction operator. In 
m authors show that L 2 projection is a good choice for V. In that specific 
case: 

V = Cm -1 

where m is the fine scale mass matrix and where C the “mixed” mass or 
correlation matrix. Note that m is the lumped diagonal mass matrix when 
using spectral finite elements: the m_v’s are the global components of the 
local weights make- 

Nm 

m N — m wjATfcAf;e- 
1=1 

The application of V c to the residual r proceeds as: 


V c r = m 1 C T VoCm V. 




(9) 


R 

Assume that capital letters indices I, J and I\ are coarse indices: /, J, K = 
0,1. Local correlation matrices are computed as 

Cijk,iJK-e = J 4>i(j)j<j)k$i$j$Kdxdydz. 

where the <I>’s are coarse ID shape functions. Using fine scale GLL points 
for integration, we have 

CIJ K,ijk\e — mijk-e- 

Vandermonde matrix 


Bl.JK,ijk = $l(ti)®j(tj)® K (tk) 
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of size 8 x n 3 is computed once and stored. 

Restriction of the global residual rj^ starts by scaling it by the inverted 
mass matrix to form yjy = rj\f/mjy and to use ([6]) to form the local vector 
Uijk-e■ Then, yijk-e is used to form the local coarse residual (see ®»: 

n n n 

RlJK-e = EEE BlJK,ijk Uijk;e ijk\e• 

*=0 j =0 k =0 

Local coarse residual RijK.e is gathered to the coarse grid nodes to form 
the global coarse residual R. Coarse preconditioner Vq is then applied to 
get the global vector Z = PqR. Global vector Z is then scattered to form 
ZjjK-e■ We finally prolongate Zjjx-e as 

i i i 

Zijk;e ~ EEE BlJK,ijk ZjjK-e rriijk-e- 

1=0 J=0K=0 

and scatter Zijk- e to form its global version zjy. Global vector yjg- is finally 
scaled by 1 /mjy to form the coarse scale correction V c r\j^ = ztf/mjy. 

4-1.1. Algebraic multigrid 

An aggregation based algebraic multigrid method is used to obtain an 
approximate solution of coarse grid system. At first, the coarse grid matrix 
is computed using finite elements with polynomial order n = 1. Then, a hi¬ 
erarchy of matrices are constructed using an unsmooth aggregation method 
|8|. An approximate solution for the coarse grid problem is obtained by the 
application of a recursive K-cycle m along with damped Jacobi smoothing 
at each level in the hierarchy as described in [8]. Experimental results show 
that the algebraic multigrid method provides h-independence convergence. 

4-2. Local preconditioner 

An overlapping additive Schwartz approach is used to build the local 
preconditioner. A set of Ne overlapping subdomains is defined. Each sub- 
domain s is an element of the mesh with one layer of nodes overlapping 
neighbor elements on all its faces (Figure [l|, with a total of (n + 3) 3 GLL 
nodes per subdomain. 

The restriction of rjy on subdomain s is denoted by rijk-s , — 1 < i, j, k < 
n + 1. It is computed using a table lf_^ g (ijk: s ) that allows to scatter global 
vectors to the vertices of the subdomains: 


fijk-,s - r i?=y g (ijk;s) • 
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Figure 1: Fine scale stencil for the local preconditioner. Left: GLL nodes shown super¬ 
imposed on physical elements. Right: topological relationship of GLL nodes on reference 
element. 

The principle is to solve problem ([3]) on each subdomain i.e. find the fine 
scale correction Zijk-b that is the solution of ^ with r^k-g as right hand side 
and with suitable boundary conditions. 

We make an assumption that greatly simplifies the computation of fine 
scale corrections, especially for the case of unstructured meshes. Fine scale 
corrections are computed without taking into account geometric factors i.e. 
assuming that hexahedra’s faces and edges are aligned with the axis of co¬ 
ordinates x, y and £. Each hexahedra is approximated as a parallelogram of 
dimensions h x , h y and h z . This geometric simplification actually allows to 
compute coarse scale corrections in an efficient way. 

Consider a ID pencil of size h x with 

n 

u(t) = 

i=0 

with x = htl/ij. and Ui = u(ti). Assuming k and c constant per element, the 
ID element matrix corresponding to ([3]) is 

f hx f dd>i dfij \ 2 k ch x . 

Kij = / ) dx - ( DimDjmPm ) H-— PjSij. (10) 

W—0-^, 

dij 

where dij is the discrete second derivative. 
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Figure 2: The ID pencil. 

Let’s extended the pencil with one GLL vertex on its left (point —1 on 
Figure [2]) and on its right (point n + 1 on Figure [ 2 ]). The stiffness operator 
on the extended domain is computed using standard finite element assembly 
procedure i.e. adding contribution of vertex —1 and n + 1 in the finite 
element matrix. Homogeneous Dirichlet boundary conditions are computed 
on points —2 and n + 2 so that the final matrix hj, —1 < i,j < n + 1 is 
invertible. The ID finite element problem is finally written as 


4 k 

hi 


dll 

^10 


V 


or in matrix form 





Pi 

" 



2-1 

" 

2doo 

don 

+ c 

2 po 




2 0 

II 

dnO 

2doo dm 



2 P 0 

) 


z n 


d 10 d 11 



Pi 

/ 


. Z n +1 _ 



ywK + cM] z = ^-r —)■ (yyM l K + cl) z = j—M 1 r. 

hi ) h x \hi ) h x 


Matrix L = M 1 K is diagonalizable 


L = V~ X AV 


and has real and positive eigenvalues Aj. The diagonalization of L leads to 
the explicit solution: 


z = 


V 


-1 


4 k 

K 


A + cI 


-1 


V 




In 3D, the local subproblem consist in finding the fine scale correction Zijk-b 
solution of 


4 k 


n+l 

E 

m =—1 


h1 Lim 


z mjk\s H - LjmZimk’, 8 H“ 


+ cz ijk\s ~ r ijk-,s 


with 


r ijk;s 


h x hyh z MaMjjMhk 


'f'ijk\S' 






















11 


It is then possible to compute the solution as we did in ID: 

n+1 n+1 n+1 


z ijk;s ~ ^ D?e 

d=—l e=—1 /=-l 


1 


+ % + i)+ C 


n+1 n+1 n+1 


E E E VadYbeVcffabc-,s 


( 11 ) 


a=— 1 6=—1 c=—1 

The fine correction for subdomain s is added to the fine preconditioner 


V f 


+. g (b'fc;s) 


V f 




+ z ijk -s , i,j,k e[-l,n + l] 3 . 


5. Implementation 

Our implementation uses OCCA which is a novel approach that includes 
a unified kernel language that expands to multiple different threading lan¬ 
guages m that has recently been demonstrated to be an effective platform 
for implementing discretizations of hyperbolic problems [S HU H9] • The two 
dominant cost kernels are the computation of residual and the fine grid 
preconditioner z t jk ;s (see (HD- 

General purpose parallel programming on GPUs allow a two-level paral¬ 
lelism. GPU devices have a large number of multiprocessing units that can 
run threads. Threads are organized in blocks and whole thread blocks are 
executed by a multiprocessing unit. The global device memory of the GPU 
is accessible by all the threads of all blocks. Some memory is only accessible 
by a given block of threads (shared memory) and has relatively low latency 
compared to global device memory. In our implementation, each element is 
assigned to a thread block and residual calculations at each GLL node are 
assigned to one thread. 

5.1. An OCCA kernel for computing r^k-e 

Algorithm [2] describes our implementation for computing r^k-e- variables 
with superscript s implies that the data is stored in on-chip shared memory 
accessible to the threads processing element e. 

Parameters k and c that are element-based are copied in shared memory, 
as well as the derivative matrix D^ and the current solution u. Then, (n + 
l) 3 threads are launched in order to compute (i) derivatives of u (2x3x (n+1) 
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Algorithm 2: Computation of 


r ijk;t 


for e = 1,..., IV# do 

K S e = K e , C% = C e . 
for i.j E [0, n] 2 do 

L Dfi = Dir 

for i.j. k E [0, n] 3 do 

|_ u ijk;e = u Ii-> g (ijk;e)- 

barrier(); 

for i,j,k E [0,n] 3 do 


du 

at 

du 

d( 


, S On 

m =0 
n 

= V D S 7/ S 
, Z-, km ijm:e' 

Vk',e m=0 




<9u 


= E D 


ijk;e m=0 


jm imk\e'> 


L J^ m=U 

Compute A| ifc;e , B? jk . e , C s ijk . e . as in @ 
barrier(); 
for i,j, k E [0, n] 3 do 


n 

^ijk\e ^e ^ ^ ^mj^imk\e ^m\fiijm\e\ ^e^ijk^eT^ijk^t 

m =0 
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operations per thread), (ii) gradients 


A s , 

^ijkr.e 


JDS 

ijk:e 


(-IS 

^ijk:e 


1 

du 




du 




du 


1 

ijk m ,e 

di 

ijk-,e 

+ 

^ijk^e 

drj 

ijk\e 

+ 

(^9 

^ijk^e 

d( 

ijk\e 

o 

du 




du 




du 


Z 

ijk\e 

di 

ijk;e 

+ 

^ijk^e 

di] 

ijk\e 

+ 

(-10 

^ijk^e 

d( 

ijk\e 

q 

du 




du 




du 


o 

ijk\e 

di 

ijk\e 

+ 

n? 

^ijk^e 

drj 

ijk;e 

+ 

(-iO 

^ijk^e 

d( 

ijk\e 


( 12 ) 


(3x5 operations per thread). Finally, the residual is computed at every 
GLL node (2 x 3 x (n+ 1) +3 operations per thread). Two barriers separate 
the three main parts of the algorithm. The maximum number of threads 
per thread block in CUDA is currently 1024, our kernel can only be used 
up to n = 9 for which (n + l) 3 = 1000. In OpenCL the number of work- 
items per work-group is limited to 256 on AMD GPUs and consequently we 
are limited to n = 5 in that case. The total number of operations Or for 
computing Tijk- e is 

Or = N e x [12 x (n + l) 4 + 18 x (n + l) 3 ] . 


Memory bandwidth quantifies the rate of throughput of data transfer be¬ 
tween GPU and the GPU global device memory. In modern GPU architec¬ 
tures, a typical value for the memory bandwidth is 250 GB/sec i.e. about 
75 billion floats per second. The total number of floating point numbers 
transferred from the global memory to the shared memory here is 

Br = Nr x [10 x (n + l) 3 + (n + 1)^ + 2] 

where the dominant term 10 x Nr x (n + l) 3 correspond (per GLL point) 
to the 7 geometric factors G™ fc . e , 1 < m < 6 and the weighted masses m^k-e, 
to the solution Uijk- e , to the residual itself rijk-e and to the table Ii^ g (ijk\ e) 
that is used for scattering uj\f. 

At this point, it is interesting to look at the ratio Or/Br Any mod¬ 
ern GPU can theoretically deliver over one TeraFlop while its bandwidth is 
limited to 250 GB/sec. This means that the value Or/Br should be over 
20 to ensure that the computation is not limited by bandwidth limitations. 
Here, ratio Or/Br is equal to seven for n = 4: we expect the performances 
of our kernel to be strongly limited by the GPU bandwidth, especially at 
low orders. It is therefore interesting to consider an alternative approach 
where the seven geometric factors are computed on the fly. In this case, the 
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only geometric data that have to be shipped to the GPU are the position of 
the vertices of the mesh as well as the connectivity table of the hexahedra. 
More operations are required to compute the geometric factors: more pre¬ 
cisely, 242 floating point operations per GLL node are required to compute 
all seven geometric factors. In what follows, Or will be computed in two 
different manners, whether the extra computations that have been done for 
computing geometric factors are or are not not taken into account. In this 
new scheme, memory bandwidth is indeed reduced to 

Br = Nr x [3 x (nil)'* + (n+ l) 2 + 2] 

5.2. An OCCA kernel for computing z^s 

The local preconditioning step that computes^*,;,; is the remaining ex¬ 
pensive part of the algorithm. Each subdomain s is assigned to a thread 
block. Then, (n + 3) 3 threads are launched in every block in order to com¬ 
pute Zijk-s, which limits the order to n = 7 for CUDA and n = 3 for OpenCL. 
Equation © explicitly provides a formula for Pijk- S • The most significant 
computations are the six successive products with the left and right eigen¬ 
vectors V -1 and V of the ID operator. More precisely, the total number of 
operations Op for computing pijk- s is 

Op = Nr x [6 x (n + 3) 4 + 15 x (n + 3) 3 ] . 

Geometric factors are not loaded by this kernel, which implies that the 
total amount of data that is transferred is reduced. Vectors Zijk-b, r ijk-.b as 
well as the scattering table If {ijk\ b) are the three large local vectors that 
have to be transferred from the global memory to the shared memory. 

Bp = 4 x Nr x [3 x (n + 3) 3 + 4 x (n + 3) 2 ] . 

The ratio Op/Bp is already large at low orders to expect good performances 
of the kernel for any n. 

5.3. A strategy for computing the preconditioning step 

We have shown that the most intensive part of the computation is the 
evaluation of the local preconditioner Zijk- S - In an additive Schwartz proce¬ 
dure, coarse and fine preconditioners can be computed independently and 
summed afterwards. Here, we propose to compute the coarse scale precondi¬ 
tioner on the CPU while computing the fine scale part on the GPU. At low 
orders, we expect that the computation of the coarse scale preconditioner on 
the CPU will hide the computation of the fine scale preconditioner on the 
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Mesh 1 Mesh 2 Mesh3 


Figure 3: Three meshes. Left: cube domain meshed with uniform elements. Center: 
nearly uniform elements on distorted cube domain. Right: strongly distorted elements 
forming mesh for distorted cube. 


GPU. At higher order, the fine scale preconditioner will be more expensive 
and will hide the computation of the coarse scale part. There may exist a 
sweet spot where both are taking equivalent time. 

6. Results 

6.1. Poisson problem 

We choose the parameters in out model problem (|2j) in order to turn it 
into a Poisson equation. We choose k = 1, c = 0 and s = 1. 

6.1.1. Analysis of the two scale strategy 

In this section, we investigate the proposed numerical scheme. Three 
coarse meshes with Ne = 8 3 hexes are considered (see Figure [ 3 ]) and uni¬ 
formly refined up to JV® = 128 3 . A polynomial order n = 3 is used for 
the computations. The strategy that has been chosen for preconditioning 
is h-optimal: the number of CG iterations is asymptotically stable for all 
meshes. If only the fine scale preconditioner is used, the number of itera¬ 
tions doubles when the mesh size is divided by two. It must be noted that 
the computation of V c (on the CPU) is fully hidden by the computation of 
■pf 

(on the GPU) for n > 2. The coarse scale preconditioner has a zero cost 
for a large benefit. 

Note that the number of CG iterations increases with polynomial order: 
for mesh 2 at level of refinement 32 3 and for n = 7, the number of CG 
iterations grows to 23. This is due to the fact that the size of the overlap 
decreases while n grows. A larger overlap would fix that issue |21| . 
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DOF count 

8 3 

16 3 

32 3 

64 3 

128 3 

N 

1.5 10 4 1.1 10 5 

9.1 10 5 

7.1 10 6 

5.7 10 7 



Two scale strategy: V 

= T C + Tt 

Mesh 1 

10 

11 

13 

13 

13 

Mesh 2 

10 

13 

15 

15 

15 

Mesh 3 

12 

18 

21 

21 

21 


Fine scale preconditioner only V 

= v t 

Mesh 1 

10 

16 

28 

52 

103 

Mesh 2 

13 

20 

32 

60 

118 

Mesh 3 

14 

23 

41 

94 

185 


Table 1: Number of Conjugate Gradient iterations as a function of the preconditioning 
strategy. 


6.1.2. Performances of the kernels 

We consider here the 32 3 version of Mesh 2 (Figure [ 3 ]) . Table [ 2 ] present 
flops counts and bandwidth results for different polynomial orders for a 
uniform mesh of 32 3 elements. Quantities with a prime refer to the kernels 
for which the geometric factors were computed on the fly. In Table [2j two 
numbers are given in the row relative to rh fc . e ’s flop count. The first one 
does not take into account the operations that are required to compute the 
geometry factors while the second number adds to the flop count Ne x 
[242 x (n + l) 3 ] floating point operations which correspond to the on the 
fly computation of the 7 geometric factors. 

It is interesting to see that the difference in wall clock time between the 
two approaches is below measurement error. Computing geometric factors 
in an element is relatively floating point intensive: it consist essentially of 
loading the 24 coordinates of the eight vertices on registers and perform¬ 
ing 242 floating point operations required to compute the geometric factors 
, 1 < m < 6 and m^k-e that are themselves stored on registers. 
The specific GPU that has been used has a very high peak floating point 
performance (over five teraflops in single precision). Computing geometric 
factors is done a a rate that is close to the peak performance of the machine 
while computing the rest of the kernel r' i j k . e requires memory access that 
are bounded by bandwidth. In the case of kernel r^e, loading geometric 
factors requires bandwidth: the GPU stalls quickly at about 150 GB/sec 
and so does the flop count. If geometric factors are taken into account in 
the flop count, the performances of this kernel climb to over one teraflop! 

We have done the same computations on a different GPU (see Table [3] ) 
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n = 2 

n = 3 

n = 4 

n = 5 

n = 6 

n = 7 

N 

2.7 10 5 

9.1 10 5 

2.1 10 B 

4.1 10 B 

7.1 10 B 

1.1 10 7 

r ijk . e (GFLOPs) 

70 

145 

227 

289 

325 

362 

r ijk-,e (GFLOPs) 

75/336 

135/763 

250/1050 

272/1020 

286/1050 

293/1050 

z ijk . s (GFLOPs) 

370 

371 

408 

453 

421 

373 

rijk-e (GB/sec) 

59 

101 

135 

150 

150 

150 

r 'ijk;e (GB/sec) 

32 

48 

75 

71 

66 

61 

Zij k ;S (CrB/sec) 

96 

80 

76 

74 

61 

49 

Wall Time (sec) 

0.10 

0.12 

0.23 

0.34 

0.53 

0.83 

Wall Time’ (sec) 

0.11 

0.12 

0.22 

0.34 

0.52 

0.80 


Table 2: Performances of the two expensive kernels Vijk;e and Zijk-,s on a NVIDIA GTX 
980 GPU using OpenCL. A uniform mesh of 32 3 hexahedra was used for the test. Apos¬ 
trophes indicate that geometric factors were computed on the fly. In row relative to r' j7 .. e , 
the two numbers correspond whether the cost cof computing geometrical factors was not 
or was taken into account in Or. 


with a lower throughput. Kernel r[- k . e is slightly slower than kernel rijk-e- 
6.2. Heat equation 

We choose the parameters in out model problem ([2]) in order to turn it 
into a heat equation. A backward Euler scheme is used to advance in time. 
We choose n = 10“ 2 [m 2 /sec] (aluminium), a time step of At = 0.04 seconds 
and c = 1/At. The source term in ([2]) is chosen as r = u^/ At /At + Q/(pc p ) 
where w^/ At is the solution at previous time step, Q = 1000 [W] is a volume 
heat source p = 7000 [kg/m 3 ] and cp = 0.8. The volumetric heat source 
Q(x,y,z ) is moved while time is advancing (see Figure [4]). 

The geometry and the mesh that we consider are presented in Figure |4j 
We solved 70 time steps of the discretized heat equation on this mesh us¬ 
ing different devices (GPU and CPU) and using different threading systems 
(CUDA, OpenCL, and OpenMP). All computations are performed in single 
precision arithmetic, either on an NVIDIA © GTX 980 GPU or on a 8 
core Intel © Core™ i7-5960X CPU @ 3.00GHz that has a theoretical peak 
performance of about 300 GFLOPS. Results are compiled in Table |4j Geo¬ 
metrical factors were always computed on the fly and flop count takes into 
account their computation. Thanks to OCCA, the computations were done 
both on the GPU and on the CPU using common computational kernels. 
CPU to GPU speedups of around 10 were observed on all computations: 
this is about the ratio of memory bandwidths between GPU (200 GB/sec) 
and the CPU (20 GB/sec). Higher speedups were obtained at higher or- 
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n = 2 

n = 3 

n = 4 

n = 5 

n = 6 

n = 7 

N 

2.7 10 5 

9.1 10 5 

2.1 10 6 

4.1 10 e 

7.1 10 e 

1.1 10 7 

rijk- e (GFLOPs) 

45 

117 

194 

234 

227 

287 

<jk-e (GFLOPs) 

28 

68 

91 

106 

105 

112 

Zijk-A GFLOPs) 

313 

296 

285 

380 

276 

371 

rijk-e (GB/sec) 

38 

82 

115 

121 

104 

118 

r ijk-,e (GB/sec) 

12 

24 

27 

28 

24 

23 

Zijk-s (GB/sec) 

81 

64 

53 

62 

40 

48 

Wall Time (sec) 

0.10 

0.13 

0.26 

0.39 

0.67 

0.92 

Wall Time’ (sec) 

0.11 

0.14 

0.29 

0.46 

0.79 

1.13 


Table 3: Performances of the two dominant cost kernels r^k-e and Zijk-, s on a NVIDIA K40 
GPU using CUDA. A uniform mesh of 32 3 hexahedra was used for the test. Apostrophes 
indicate that geometric factors were computed on the fly. 



Figure 4: Rod geometry considered for the heat equation with the hex-mesh of 62540 
hexahedra.. Temperature field is shown at iteration 70 which correspond to t = 2.8 
seconds. The motion of the heat source is visible on the plot. Mesh is visible on the right 
part of the plot 
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n = 2 

n = 3 

n = 4 

n = 5 

n = 6 

n = 7 

device mem.(MB) 

178 

370 

587 

1147 

1796 

2660 

N 

5.3 10 5 

1.7 10 6 

4.1 10 6 

8.0 10 6 13.8 10 6 

21.9 10 6 


Total wall time (sec) for 70 time steps 

CUDA(GPU) 

13.4 

14.8 

32.1 

53.6 

90.3 

134.9 

OpenCL(GPU) 

12.2 

13.9 

31.8 

49.9 

83.0 

126.3 

OpenMP(16 threads) 

100.7 

127.4 

355.1 

509.9 

901.5 

1393.5 

OpenCL(16 threads) 

102.7 

156.2 

384.2 

479.7 

1052.1 

1168.5 


Performance in GFLOPs of r' ijk;e 

CUDA(GPU) 

674.0 

923.0 

926.0 

874.0 

806.0 

795.0 

OpenCL(GPU) 

1140.0 

1660.0 

1690.0 

1390.0 

1380.0 

1310.0 

OpenMP(16 threads) 

24.5 

68.8 

59.3 

58.5 

57.8 

90.0 

OpenCL(16 threads) 

30.2 

39.3 

42.4 

42.9 

43.0 

206.0 


Performance in GFLOPs of Zijk- e 

CUDA(GPU) 

479.0 

533.0 

492.0 

535.0 

479.0 

591.0 

OpenCL(GPU) 

422.0 

385.0 

457.0 

470.0 

445.0 

478.0 

OpenMP(16 threads) 

27.9 

40.4 

38.9 

63.9 

49.4 

51.5 

OpenCL(16 threads) 

23.7 

26.6 

27.4 

82.6 

28.8 

28.5 


Bandwidth 

in GB/sec for r' ijk . e 



CUDA(GPU) 

56.9 

74.0 

70.9 

64.2 

56.9 

54.1 

OpenCL(GPU) 

96.1 

133.0 

129.0 

102.0 

97.7 

89.9 

OpenMP(16 threads) 

2.1 

5.5 

4.5 

4.3 

4.1 

6.1 

OpenCL(16 threads) 

2.5 

3.1 

3.2 

3.1 

3.0 

14.0 


Bandwidth 

in GB/sec for Zijk- e 



CUDA(GPU) 

125.0 

116.0 

91.4 

87.2 

69.5 

77.2 

OpenCL(GPU) 

110.0 

83.4 

84.9 

76.6.0 

64.6 

62.5 

OpenMP(16 threads) 

7.2 

8.7 

7.2 

10.4 

7.2 

6.7 

OpenCL(16 threads) 

6.2 

5.8 

5.1 

13.5 

4.2 

3.7 


Table 4: Overall timings and performances for different devices and threading systems. 
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ders thanks to the more computationally intensive nature of higher order 
computations. 

Wall clock times essentially depend on the device: CUDA and OpenCL 
give similar performances on the GPU. On the CPU, both OpenMP and 
OpenCL were run on 16 threads using the hyper-threading capacities of our 
8-core processor. 

Global performance of the code is similar on the GPU, whether OpenCL 
or CUDA is used as threading system. OpenCL is indeed slightly more 
efficient, especially for computing the residual. 

On the CPU, OpenMP outperforms OpenCL, except for n = 5 and 
n = 7 where either r'^ k . e (n = 7) or Zijk ;e ( n = 5) shows up extraordinary 
performances with OpenCL. At those respective orders, both the kernels act 
on array of size 8x8x8. This array size being an exact fraction of a cache 
line, memory bandwidth is essentially doubled, reaching about 14 GB/sec. 
Note that padding our arrays to a size that is a power of two could certainly 
improve performances at all orders. OpenCL seems to be better at loop 
vectorization as well, both on CPU and on GPU. 

7. Conclusions 

In this paper, we have shown that implicit time-steps with over 20 mil¬ 
lions of degrees of freedom in less than half of a second was possible on an 
off-the-shelf $500 GPU. One of the important features of the method is the 
two-scale preconditioner with its two parts that can be computed simulta¬ 
neously on the host and the device. 

Another important aspect that we have pointed out, and that signifi¬ 
cantly differs from the usual way of dealing with finite element codes, is 
that it is more interesting to re-compute quantities like geometrical factors 
on the fly than to store them in global memory. This is essentially because 
GPU computing is more memory bound when compared to CPU computing 
when considering bandwidth available to each processing element. There¬ 
fore, it is more important to reduce the amount of data that is transferred 
from the global memory on a GPU than on a CPU. 

Finally, the use of the common kernel language OCCA allows us to 
change the device and the thread model in a very simple fashion i.e. using 
the same code for all platforms. The OCCA kernel language is essentially 
a “C” language with extra decorations, which makes it very readable to 
computational scientists. 

One order of magnitude speedups is observed between the code running 
on the CPU and on the GPU. In both cases, we are able to use about 10% 
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of the peak performance of the device, which is to our best knowledge, close 
to the best one can get. 

This paper is essentially about a fast Poisson solver. On the one hand 
the Poisson equation solver is a rather simple proxy application, but on the 
other hand a Poisson solver is the building block of more complex models 
such as incompressible Navier-Stokes equations or acoustics equations. In 
further work, non-conforming meshes as well as incompressible fluids will be 
considered. 
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